GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count by mdibaiee · Pull Request #3575 · apache/parquet-java

mdibaiee · 2026-05-22T11:21:53Z

Rationale for this change

Missing null_count statistics for columns in parquet files can cause issues with downstream consumers of these files. It is not necessary to omit this statistic for columns which are larger than the truncation configuration, since despite the truncation, their nullability can be asserted with confidence. It is reasonable to keep omitting min/max statistics due to the rationale explained in the comment in the code.

What changes are included in this PR?

Always add null_count statistics for columns in parquet files, unconditional of their size.

Are these changes tested?

Yes, TestParquetMetadataConverter.java has been updated to reflect these changes

Are there any user-facing changes?

I think we can consider the additional null_count statistic's appearance as a user-facing change

Closes #3574

mdibaiee · 2026-06-02T14:47:08Z

@wgtmac hey, thanks for the initial review. any chance of getting the workflows approved for running so the checks also run and we are closer to merging?

If any help from our side can help expedite this please let us know, as we have customers blocked on this and are happy to help as much as possible. Appreciate your time 🙇🏽

wgtmac reviewed May 23, 2026

View reviewed changes

Comment thread parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java Outdated

parquet-hadoop: Statistics.toParquetStatistics: always set null_count

b6d0e5b

mdibaiee force-pushed the metadata-null-counts-truncated branch from 83b6461 to b6d0e5b Compare May 23, 2026 12:15

mdibaiee mentioned this pull request May 27, 2026

materialize-iceberg: nullable in fieldConfig estuary/connectors#4544

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575
mdibaiee wants to merge 1 commit into
apache:masterfrom
mdibaiee:metadata-null-counts-truncated

mdibaiee commented May 22, 2026

Uh oh!

Uh oh!

mdibaiee commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mdibaiee commented May 22, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

mdibaiee commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants